NetNews Offline 2

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Offline 2 / NetNews Offline Volume 2.iso / news / comp / lang / c-part2 / 15118 < prev next >

Wrap

Internet Message Format | 1996-08-05 | 1.7 KB

Path: beta.nedernet.nl!usenet From: jos@and.nl (Jos A. Horsmeier) Newsgroups: comp.misc,comp.unix.programmer,comp.lang.c,comp.programming Subject: Re: Full Text Search Algorithms Date: 17 Apr 1996 10:19:47 GMT Organization: AND Operations Research B.V. Message-ID: <4l2gk3$lgd@beta.nedernet.nl> References: <3174436E.1A36@flashnet.it> NNTP-Posting-Host: klepzeiker.and.nl Mime-Version: 1.0 Content-Type: Text/Plain; charset=ISO-8859-1 X-Newsreader: WinVN 0.99.5 In article <3174436E.1A36@flashnet.it>, intarget@flashnet.it wrote: |I need to build a full text search engine which can handle a large amount |of documents (about 300,000 plain text documents 3Kbytes long). |I tried with by implementing inverted lists of words with BTrees indexes, |but the resulting software becomes too slow after the first 20,000 |documents. In addition, there's too much wasted space in the indexes. | |Can anybody suggest me a better algorithm and, if possible, tell me where |can I find a technical description of it (possibly on the NET)? Maybe the following approach helps you out: - construct a list of unique words such that if a word is an element of this list, then this word occurs at least once in at least one document. - for each word in the list above, construct a bit list such that if a bit x in this list is set, then the word occurs in document number x. Words like 'the', 'and' etc. most likely have all their bits set in their bit lists and may be removed from the first list. Further reduction, i.e. compression of the bit lists, greatly reduces the total size of the database. Simple tricks like run length encoding come to mind ... kind regards, Jos aka jos@and.nl -- Atnwgqkrl gy zit vgksr, ug qshiqwtzoeqs!